Wine quality classification

The purpose of this project is to develop an in-depth fully Bayesian analysis to model wine quality based on physicochemical tests such as :

  1. fixed acidity
  2. volatile acidity
  3. citric acid
  4. residual sugar
  5. chlorides
  6. free sulfur dioxide
  7. total sulfur dioxide
  8. density
  9. pH
  10. sulphates
  11. alcohol

The output variable is the quality good or bad based on these physicochemical properties. The dataset is free available on UCI Machine Learning Repository , one of the most huge database of machine learning problems hold by the Center for Machine Learning and Intelligent Systems at the University of California.

Features Analysis

As first thing let’s divide the dataset into train and test with proportion equal to \(80\% - 20\%\) respectively (for evaluating at the and the performances of our model) and let’s see how the features are distributed and some other characteristics of the dataset.

To avoid to display all the features here I report only the ones most known, for the others we are going to see only a brief summary.

Fixed acidity

Fixed acidity is an important features of wines. Acids are fundamental for the wine conservation, for example a long-lived wine suitable for aging must have a relatively high fixed acidity. In particular fixed acids give a feeling of freshness in the mouth which must be balanced by the sensations of warmth and softness induced by alcohol and sugars. Based on acidity wine can be :

  • flat wine (low acidity)
  • quite fresh wine
  • fresh wine
  • sour wine (high acidity)

pH

The pH of wine is a measure of the strength and concentration of the dissociated acids present in that medium. It is calculated using the concentration of hydrogen ions in the formula \(pH = -log10[H+]\) and can be adjusted through the addition of acid or base. It plays a critical role in many aspects of winemaking, in particular wine stability and red wine colour. For wine it is usually in the range \([2.5,4]\)

Alcohol

Alcohol in wine has a content varying between 4% (in some sweet sparkling wines) to 20% or more (in liqueur wines).Wines with an alcohol content of up to 10% are generally defined as “light,” whereas wines with an increasing alcohol content are often defined as more or less “hot.” In our dataset most of the wines are “light” (alcohol \(\leq 10\%\)) and “quite hot” (alcohol between \(11-12\%\)), there are only few wines “strongly alcoholic” (alcohol \(\geq 15\%\))

Sulfites

Sulfites are a food preservative widely used in winemaking, thanks to their ability to maintain the flavor and freshness of wine. In fact thanks to their antimicrobial properties, these compounds can prevent bacterial growth to prolong the shelf life of wines and other products. Moreover they improve taste,appearance and enhance flavor. Less sulfites are always better, but in general a typical red wine has around \(50-100 \ \ mg/L\) of them.

Dataset summary

Here I report the main statistics of all the features.

Min Max Median Mean Standard Dev
Fixed acidity 4.7 15.9 7.9 8.339 1.73
Volatile acidity 0.12 1.58 0.52 0.528 0.181
Citric acid 0 1 0.26 0.271 0.194
Residual sugar 0.9 15.4 2.2 2.531 1.381
Chlorides 0.012 0.611 0.079 0.088 0.048
Free sulfur dioxide 1 72 14 16.021 10.538
Total sulfur dioxide 6 289 38 46.711 32.993
Density 0.99007 1.00369 0.99676 0.997 0.002
pH 2.74 4.01 3.31 3.311 0.156
Sulphates 0.33 2 0.62 0.657 0.166
Alcohol 8.4 14.9 10.1 10.422 1.07

Output distribution

The number of wines that have been tested in total is 1599, this imply that the number of observation in the training set is 1279 and as wee can see it is almost perfectly balanced so we don’t need some oversampling method to increase the data of the minority class.

Observation on relevance of the features

At first glance we can see the importance of one features in its capacity of dividing perfectly the two classes. Unfortunately there isn’t a feature for which wines belonging to different classes have different values,values are evenly distributed between classes, only for alcohol and for total sulfur dioxide we can have a slight separation.

Bad wines seems to have a small amount of alcohol, and wines with an high amount of alcohol are more likely to be Good wines.

Free sulfur dioxide helps to protect wine from oxidation and spoilage, but as we can see wine with an high amount of this molecule are likely to be classified as bad. In fact an excessive quantity of \(FS0_2\) can be perceptible to consumers,by masking the wine’s own fruit aromas and by contributing a sharp,bitter,metallic,chemical flavor or sensation.

Correlation matrix

Let’s compute the Pearson correlation coefficient between all the couples of features and see if there are features strongly correlated.

As we can expected some features are strongly correlated such as fixed acids-pH, grater the presence of acids lower the pH (think to pH formula and to logarithmic scale ), but also density-alcoholic content, in fact grater the quantity of ethanol lower the density of the wine.

Bayesian Logistic Regression

In our set-up the response variable wine quality can be modeled as a Bernoulli distribution : \[ Y_i \in {0,1} \ where\ 0\ is\ bad\ and\ 1\ is\ good \] In particular Bernoulli distribution belongs to the exponential family distribution that’s why we can use Generalized Linear Model for analyzing the response variable. The only things we need to define now is:

Using all the explanatory variables we have the Generalized Linear Model is:

\[ Y_i\sim Bernoulli(p_i)\\ logit(p_i)=log\left(\frac{p_i}{1-p_i}\right)=\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+\beta_3 x_{i3}+\beta_4 x_{i4}+\beta_5 x_{i5}+\beta_6 x_{i6}+\beta_7 x_{i7}+\beta_8 x_{i9}+\beta_9 x_{i9}+\beta_{10} x_{i10}+\beta_{11} x_{i11}\\ \beta_i\sim Normal\left(\mu=0,\tau^2=\frac{1}{10^{6}}\right)\qquad for \ \ i \in{1,2,...,11} \] Model coefficients \(\beta_i\) are typically defined as Normal because they can take values between all real numbers and the Normal supports is exactly \((-\infty,+\infty)\).

Jags implementation

model <- function(){
  # Likelihood
  for (i in 1:N){
    y[i] ~ dbern(p[i])
    logit(p[i]) <-beta0+beta1*x1[i] + beta2*x2[i] + beta3*x3[i] + beta4*x4[i] + beta5*x5[i] + beta6*x6[i] + beta7*x7[i] + beta8*x8[i] + beta9*x9[i] +beta10*x10[i]+beta11*x11[i]
  }
  
  # Defining the prior beta parameters
  beta0 ~ dnorm(0, 1.0E-6)
  beta1 ~ dnorm(0, 1.0E-6)
  beta2 ~ dnorm(0, 1.0E-6)
  beta3 ~ dnorm(0, 1.0E-6)
  beta4 ~ dnorm(0, 1.0E-6)
  beta5 ~ dnorm(0, 1.0E-6)
  beta6 ~ dnorm(0, 1.0E-6)
  beta7 ~ dnorm(0, 1.0E-6)
  beta8 ~ dnorm(0, 1.0E-6)
  beta9 ~ dnorm(0, 1.0E-6)
  beta10 ~ dnorm(0, 1.0E-6)
  beta11 ~ dnorm(0, 1.0E-6)
}

Summary table of \(\beta\) coefficients

Running JAGS with 3 chains, 10000 iterations, a burn-in of 1000 steps and a thinning of 10 we have:

mean sd 2.5% 25% 50% 75% 97.5% Rhat n.eff
beta0 50.0141 88.1723 -121.2143 -10.2418 49.7109 108.9489 222.7448 1.0019 2700
beta1 0.1539 0.1113 -0.0688 0.0786 0.1534 0.2313 0.3726 1.0012 2700
beta10 3.2509 0.5371 2.2244 2.8857 3.2430 3.6082 4.3148 1.0006 2700
beta11 0.8822 0.1155 0.6629 0.8034 0.8817 0.9607 1.1073 1.0019 1400
beta2 -3.6282 0.5410 -4.7027 -3.9973 -3.6057 -3.2665 -2.5934 1.0005 2700
beta3 -1.6651 0.6367 -2.8967 -2.0901 -1.6795 -1.2334 -0.4021 1.0009 2700
beta4 0.0934 0.0623 -0.0326 0.0528 0.0940 0.1344 0.2186 1.0019 1400
beta5 -4.9134 1.8238 -8.5378 -6.1146 -4.9182 -3.6971 -1.3180 1.0006 2700
beta6 0.0263 0.0093 0.0082 0.0200 0.0264 0.0325 0.0443 1.0023 1100
beta7 -0.0178 0.0033 -0.0243 -0.0200 -0.0178 -0.0156 -0.0114 1.0014 2100
beta8 -58.4842 89.9910 -233.5058 -119.0697 -58.0267 2.8774 116.1996 1.0019 2700
beta9 -0.3254 0.7955 -1.9092 -0.8483 -0.3179 0.2116 1.1950 1.0009 2700
deviance 1309.8859 5.0944 1302.2275 1306.3665 1309.2979 1312.7002 1320.7248 1.0027 880

DIC of the model 1322.8427392.

The point estimates for \(\beta_i\) are shown in the “Mean” column. The \(95\%\) credible intervals for \(\beta_0,\beta_1,\beta_4,\beta_8,\beta_9\) contain 0 , which means that the variables \(x0 (intercept),\ x1\ (fixed acidity),\ x4\ (residual sugar),\ x8\ (density) ,\ x9\ (pH)\) can be statistically insignificant and we can try to remove them and see if the DIC decreases. Moreover also for the other parameters we have nice results:

  • Rhat ( potential scale reduction factor ) is a measure of the convergence of the chain, it indicates if all the chains converge to the same value, in particular it is a comparison between chains variance and within chains variance, if they are similar they become from the same distribution. The value of Rhat must be between 1 and 1.1

  • n.eff indicates the effective sample size that can be interpreted as the number of independent Monte Carlo samples necessary to give the same precision of the MCMC samples. Grater this value , lower the autocorrelation between the MCMC steps and better is the final approximation. In fact let’s make a remark on the quality of a MCMC : The quality of this approximation derives from the variance of the MCMC that we have performed and that is equal to the MC variance plus a term that depends on the correlation of the samples within the Markov chain. Generally this term is positive and so we expect an approximation further away form the MC’s one.The formula is: \[ Var_{MCMC}[\hat{I}]=\frac{Var[\hat{I}]}{S_{eff}}=\left(1+2\sum_{k=1}^{\infty}\rho_k\right)\frac{\sigma^2}{t} \] where \(\hat{I}\) is the mean of the parameter we are interested in.

  • DIC (deviance information criterion) is particular useful in Bayesian model selection, because it measures the “goodness-of-fit” of the model.There isn’t a standard absolute scale for DIC, it can take any positive value and it is equal to \(DIC=D_{\hat{\theta}}(y)+2p_D\) where \(D_{\hat{\theta}}(y)\) is the deviance of the model (-2*log(likelihood)) and \(p_D\) is the number of effective (“unconstrained”) parameters in the model.Since \(D_{\hat{\theta}}(y)\) will decrease as the number of parameters in a model increases, the \(p_D\) term compensates for this effect by favoring models with a smaller number of parameters.Models with smaller DIC should be preferred to models with larger DIC.

Frequentist Logistic Regression

Estimate Std. Error z value Pr(>|z|)
(Intercept) 49.5255 90.4013 0.5478 0.5838
x1 0.1528 0.1118 1.3670 0.1716
x2 -3.5947 0.5491 -6.5464 0.0000
x3 -1.6427 0.6314 -2.6015 0.0093
x4 0.0933 0.0617 1.5105 0.1309
x5 -4.7436 1.7955 -2.6419 0.0082
x6 0.0259 0.0093 2.7960 0.0052
x7 -0.0176 0.0033 -5.3914 0.0000
x8 -57.9923 92.2671 -0.6285 0.5297
x9 -0.3050 0.8017 -0.3804 0.7036
x10 3.1833 0.5360 5.9389 0.0000
x11 0.8763 0.1189 7.3721 0.0000

As we can see from this summary ,the estimated coefficients obtained by the frequentist approach are very similar to the ones obtained with jags. Also the AIC score is similar : 1321.8358064 ( we are comparing AIC and DIC because In models with negligible prior information, DIC will be approximately equivalent to AIC ). As these results indicates that x0 (intercept), x1 (fixed acidity), x4 (residual sugar), x8 (density) , x9 (pH) maybe are not so informative,appropriate for the model.

Let’s try to remove them ans see the results.

Bayesian Logistic Regression with features selection

As said above we can try to eliminate some features in the model as see what happens. After some trials and trying different combination with the features that initially seem the less informative (intercept,x1,x4,x8,x9) I find that the features that really compromise the tests for convergence that we will see later is the intercept,x4 and x9.

model_filtered <- function(){
  # Likelihood
  for (i in 1:N){
    y[i] ~ dbern(p[i])
    logit(p[i]) <-beta1*x1[i] + beta2*x2[i] + beta3*x3[i]  +beta5*x5[i] + beta6*x6[i] + beta7*x7[i]+beta8*x8[i]+beta10*x10[i]+beta11*x11[i]
  }
  
  # Defining the prior beta parameters
  beta1 ~ dnorm(0, 1.0E-6)
  beta2 ~ dnorm(0, 1.0E-6)
  beta3 ~ dnorm(0, 1.0E-6)
  beta5 ~ dnorm(0, 1.0E-6)
  beta6 ~ dnorm(0, 1.0E-6)
  beta7 ~ dnorm(0, 1.0E-6)
  beta8 ~ dnorm(0, 1.0E-6)
  beta10 ~ dnorm(0, 1.0E-6)
  beta11 ~ dnorm(0, 1.0E-6)
}
mean sd 2.5% 25% 50% 75% 97.5% Rhat n.eff
beta1 0.1440 0.0577 0.0350 0.1040 0.1436 0.1822 0.2576 1.0006 2700
beta10 3.0581 0.5199 2.0241 2.7100 3.0569 3.4058 4.0766 1.0008 2700
beta11 0.9270 0.0844 0.7646 0.8680 0.9281 0.9832 1.0902 1.0007 2700
beta2 -3.6798 0.5422 -4.7240 -4.0426 -3.6825 -3.3305 -2.6397 1.0006 2700
beta3 -1.5937 0.6290 -2.8560 -2.0111 -1.5898 -1.1760 -0.3679 1.0009 2700
beta5 -4.3940 1.7551 -7.9050 -5.5461 -4.3758 -3.2014 -1.0099 1.0006 2700
beta6 0.0272 0.0093 0.0086 0.0208 0.0274 0.0338 0.0454 1.0006 2700
beta7 -0.0168 0.0032 -0.0232 -0.0190 -0.0168 -0.0147 -0.0108 1.0007 2700
beta8 -9.5118 1.0990 -11.6672 -10.2491 -9.5095 -8.7763 -7.4051 1.0010 2700
deviance 1310.1463 4.3922 1303.7774 1307.0177 1309.4741 1312.4607 1320.8828 1.0024 1500

Moreover removing them the DIC slightly reduces to 1319.7866759

MCMC plots

Let’s see the plots of the Markov chain, the density distribution of the values along the chain and the autocorrelation between these values.

As we can see all the Markov chains remains in the steady state , and also the autocorrelation between consecutive values of the chain is very small. In particular a Markov chain with high autocorrelation moves around the parameter space slowly,taking a long time to achieve the correct balance among the different regions of the parameter space. An independent MC sampler has zero autocorrelation and can jump between different regions of the parameter space in one step. Moreover the higher the autocorrelation in the chain, the larger the MCMC variance and the worse the approximation of the quantity of interest is.

Model checking diagnosis

Geweke Diagnostic

The Geweke diagnostic takes two nonoverlapping parts (usually the first 0.1 (after the burn-in) and last 0.5 proportions) of the Markov chain and compares the means of both parts, using a difference of means test to see if the two parts of the chain are from the same distribution (null hypothesis). If the two means are equal the chain is probable that is going to converge.

The test statistic is a standard Z-score: the difference between the two sample means divided by its estimated standard error adjusted for autocorrelation .

beta1 beta10 beta11 beta2 beta3 beta5 beta6 beta7 beta8 deviance
Z-score chain 1 -1.8075 -0.3422 -0.8420 -0.4038 0.1721 1.0140 1.5755 -2.6680 1.7418 0.1702
Z-score chain 2 1.5878 0.8119 -0.1599 1.2546 -1.0915 0.2955 -0.6954 2.8194 -1.4308 0.4184
Z-score chain 3 -1.3249 1.4023 0.5777 0.5500 1.6183 -0.5751 0.1580 -0.9854 -0.5538 0.5673

As we can see from the plots of the of the Z-scores of the first chain the majority of the values are inside the area \((-1.965,+1.965)\) so we almost always we accept the equality of the means.

Heidelberg and Welch Diagnostic

The Heidelberg and Welch diagnostic calculates a test statistic to accept or reject the null hypothesis that the Markov chain is from a stationary distribution.

The diagnostic consists of two parts:

First Part:

  1. Generate a chain of N iterations and define an α level.
  2. Calculate the test statistic on the whole chain. Accept or reject null hypothesis that the chain is from a stationary distribution.
  3. If null hypothesis is rejected, discard the first 10% of the chain. Calculate the test statistic and accept or reject null.
  4. If null hypothesis is rejected, discard the next 10% and calculate the test statistic.
  5. Repeat until null hypothesis is accepted or 50% of the chain is discarded. If test still rejects null hypothesis, then the chain fails the test and needs to be run longer.

Second Part:

If the chain passes the first part of the diagnostic, then it takes the part of the chain not discarded from the first part to test the second part. The halfwidth test calculates half the width of the (1 − α)% credible interval around the mean. If the ratio of the halfwidth and the mean is lower than some \(\epsilon\), then the chain passes the test. Otherwise, the chain must be run out longer.

Stationarity test start iteration p-value Halfwidth test Mean Halfwidth
beta1 passed 91 0.0990671028544442 passed 0.1457135324004 0.00447781714679559
beta10 passed 1 0.099827788326266 passed 3.07010252584141 0.0336063960230943
beta11 passed 1 0.244047137194526 passed 0.928647631709531 0.00556054821071862
beta2 passed 1 0.301962882265881 passed -3.67417173451653 0.0355291030712749
beta3 passed 1 0.425817122088649 passed -1.60782790581219 0.0412671722581498
beta5 passed 1 0.362788675901325 passed -4.42365049970399 0.118811595962462
beta6 passed 1 0.0819662655631511 passed 0.0271198355612989 0.000693497083399823
beta7 passed 91 0.376982426219757 passed -0.016679697105009 0.000253341646458508
beta8 passed 1 0.363482016982527 passed -9.54148548499899 0.0723475940866602
deviance passed 1 0.576438582879337 passed 1310.3602651859 0.302905368426096

All tests are passed so we can say that the Markov chain reaches the stationary distribution and that the accuracy of our estimations is small enough (intervals small enough).

Gelman and Rubin Multiple Sequence Diagnostic

Steps (for each parameter):

  1. Run m ≥ 2 chains of length 2n from overdispersed starting values.
  2. Discard the first n draws in each chain.
  3. Calculate the within-chain and between-chain variance.
  4. Calculate the estimated variance of the parameter as a weighted sum of the within-chain and between-chain variance.
  5. Calculate the potential scale reduction factor.

Within Chain Variance

\[ W=\frac{1}{m}\sum_{j=1}^m s_j^2 \] where \[ s_j^2=\frac{1}{n-1}\sum_{i=1}^n(\theta_{ij}-\bar{\theta}_j)^2 \] \(s_j^2\)is just the formula for the variance of the \(jth\) chain. \(W\) is then just the mean of the variances of each chain.

Between Chain Variance

\[ B=\frac{n}{m-1}\sum_{j=1}^m(\bar{\theta_j}-\bar{\bar{\theta}})^2 \] where \[ \bar{\bar{\theta}}=\frac{1}{m}\sum_{j=1}^{m}\bar{\theta_j} \]

This is the variance of the chain means multiplied by n because each chain is based on n draws.

Estimated Variance

We can then estimate the variance of the stationary distribution as a weighted average of W and B.

\[ \hat{Var(\theta)}=\left(1-\frac{1}{n}\right)W+\frac{1}{n}B \]

Because of overdispersion of the starting values, this overestimates the true variance, but is unbiased if the starting distribution equals the stationary distribution (if starting values were not overdispersed).

Potential Scale Reduction Factor

The potential scale reduction factor is \[ \hat{R}=\sqrt{\frac{\hat{Var(\theta)}}{W}} \]

When \(\hat{R}\) is high (perhaps greater than 1.1 or 1.2), then we should run our chains out longer to improve convergence to the stationary distribution.

Point est. Upper C.I.
beta1 1.0020796 1.009859
beta10 0.9996509 1.001040
beta11 1.0000762 1.000248
beta2 1.0003867 1.002551
beta3 1.0009550 1.006106
beta5 1.0013559 1.005364
beta6 1.0027693 1.010340
beta7 1.0026834 1.011983
beta8 1.0003802 1.003374
deviance 1.0055060 1.015004

All the values are near \(1\) which indicates that all 10000 iterations are enough for convergence to the stationary distribution.

Validate model: Simulation

To check if the MCMC implementation is correct we can run it on simulated data and verify if it is able to recover the model parameters we have fixed. The steps are the following :

  1. Fix the sample size , in this case I chose the same number of observation we have initially

  2. Generate data from the ones we have, here we can compute the ECDF and then sample from it (sampling from the empirical distribution function ecdf is the same as resampling with replacement from the sample y). This is just bootstrapping.

  3. Fix the model parameters, here I have chosen the ones obtained by the previous model

  4. Compute the linear predictor and the sigmoid value (inverse logit function).

  5. Generate the output as a \(Bernoulli\) r.v. with probability equal to the value of the sigmoid.

  6. Run JAGS and see if it is able to recovery the parameters that generate the data.

#sample size
N<-nrow(dataset)

#simulated data 
x1_simulation<-sample(x1,N,replace=TRUE)
x2_simulation<-sample(x2,N,replace=TRUE)
x3_simulation<-sample(x3,N,replace=TRUE)
x5_simulation<-sample(x5,N,replace=TRUE)
x6_simulation<-sample(x6,N,replace=TRUE)
x7_simulation<-sample(x7,N,replace=TRUE)
x8_simulation<-sample(x8,N,replace=TRUE)
x10_simulation<-sample(x10,N,replace=TRUE)
x11_simulation<-sample(x11,N,replace=TRUE)


#fixed model parameters
beta_1_sim<-mod.fit_filtered$BUGSoutput$summary[1,1]
beta_2_sim<-mod.fit_filtered$BUGSoutput$summary[4,1]
beta_3_sim<-mod.fit_filtered$BUGSoutput$summary[5,1]    
beta_5_sim<-mod.fit_filtered$BUGSoutput$summary[6,1]
beta_6_sim<-mod.fit_filtered$BUGSoutput$summary[7,1]
beta_7_sim<-mod.fit_filtered$BUGSoutput$summary[8,1]
beta_8_sim<-mod.fit_filtered$BUGSoutput$summary[9,1]
beta_10_sim<-mod.fit_filtered$BUGSoutput$summary[2,1]
beta_11_sim<-mod.fit_filtered$BUGSoutput$summary[3,1]


#linear predictor
linpred <- beta_1_sim*x1_simulation+beta_2_sim*x2_simulation + beta_3_sim*x3_simulation +beta_5_sim*x5_simulation + beta_6_sim*x6_simulation + beta_7_sim*x7_simulation+ beta_8_sim*x8_simulation+beta_10_sim*x10_simulation+beta_11_sim*x11_simulation

# probability for bernoulli 
pis <- exp(linpred)/(1+exp(linpred))

#simulated output
y_simulation <- rbinom(nrow(dataset), 1, pis)
model_filtered_simulation<- function(){
  # Likelihood
  for (i in 1:N){
    y_simulation[i] ~ dbern(p[i])
    logit(p[i]) <- beta1*x1_simulation[i] + beta2*x2_simulation[i] +  beta3*x3_simulation[i]  + beta5*x5_simulation[i] + beta6*x6_simulation[i] + beta7*x7_simulation[i] + beta8*x8_simulation[i]  + beta10*x10_simulation[i] + beta11*x11_simulation[i]
  }
  
  # Defining the prior beta parameters
  beta1 ~ dnorm(0, 1.0E-6)
  beta2 ~ dnorm(0, 1.0E-6)
  beta3 ~ dnorm(0, 1.0E-6)
  beta5 ~ dnorm(0, 1.0E-6)
  beta6 ~ dnorm(0, 1.0E-6)
  beta7 ~ dnorm(0, 1.0E-6)
  beta8 ~ dnorm(0, 1.0E-6)
  beta10 ~ dnorm(0, 1.0E-6)
  beta11 ~ dnorm(0, 1.0E-6)
}
mean sd 2.5% 25% 50% 75% 97.5% Rhat n.eff
beta1 0.1153 0.1681 -0.2075 0.0021 0.1164 0.2261 0.4548 1.0017 1500
beta10 3.2007 1.6388 0.1839 2.0332 3.1066 4.3070 6.5574 1.0106 200
beta11 0.3957 0.2699 -0.0991 0.2117 0.3817 0.5798 0.9473 1.0101 210
beta2 -3.4852 1.7046 -6.9256 -4.5792 -3.4031 -2.3056 -0.3239 1.0078 270
beta3 -1.2256 1.1792 -3.6053 -1.9988 -1.1916 -0.3955 0.9608 1.0007 2400
beta5 -6.7605 7.9983 -24.2601 -11.7522 -6.0978 -1.1951 7.4149 1.0067 390
beta6 0.3194 0.0821 0.1530 0.2594 0.3210 0.3877 0.4558 1.0223 100
beta7 -0.5004 0.1573 -0.7414 -0.6440 -0.5105 -0.3898 -0.1730 1.0384 65
beta8 -5.7586 3.3962 -12.4964 -7.9482 -5.6193 -3.3947 0.4208 1.0071 360
deviance 100.2004 22.8993 84.1825 89.7244 94.8615 103.5644 147.8861 1.0099 220

As we can see the values recovered from JAGS are very close to the ones I have fixed ,this means that our MCMC implementation is correct.

Posterior inference

We have checked that all the chains have passed the test of convergence so we can gather them and make a more accurate estimation of the parameters of interest. The principal statistics that I decide to find is the posterior points estimate, the equi-tailed intervals and the highest posterior density interval (HPD).

point estimate Quantile interval left Quantile interval right HPD interval left HPD interval right
beta1 0.1440 0.0350 0.2576 0.0362 0.2580
beta10 3.0581 2.0241 4.0766 2.0565 4.1054
beta11 0.9270 0.7646 1.0902 0.7523 1.0771
beta2 -3.6798 -4.7240 -2.6397 -4.7152 -2.6308
beta3 -1.5937 -2.8560 -0.3679 -2.8228 -0.3437
beta5 -4.3940 -7.9050 -1.0099 -7.8481 -0.9458
beta6 0.0272 0.0086 0.0454 0.0076 0.0437
beta7 -0.0168 -0.0232 -0.0108 -0.0227 -0.0104
beta8 -9.5118 -11.6672 -7.4051 -11.6077 -7.3635
deviance 1310.1463 1303.7774 1320.8828 1303.0794 1318.9505

Model Evaluation

Now we have all the necessary things for evaluate the performances of our model. We have “trained” the model on the \(80\%\) observations of the original dataset, let’s predict the result for the test set and compare with the real output.

x1_test<-dataset_test$fixed.acidity
x2_test<-dataset_test$volatile.acidity
x3_test<-dataset_test$citric.acid
x4_test<-dataset_test$residual.sugar
x5_test<-dataset_test$chlorides
x6_test<-dataset_test$free.sulfur.dioxide
x7_test<-dataset_test$total.sulfur.dioxide
x8_test<-dataset_test$density
x9_test<-dataset_test$pH
x10_test<-dataset_test$sulphates
x11_test<-dataset_test$alcohol

y_test<-convertion_vec(dataset_test$quality)
y_test<-as.vector(y_test)

linear_pred_test<-beta.hat.jags[1]*x1_test+beta.hat.jags[4]*x2_test+
             beta.hat.jags[5]*x3_test+beta.hat.jags[6]*x5_test+
             beta.hat.jags[7]*x6_test+beta.hat.jags[8]*x7_test+
             beta.hat.jags[9]*x8_test+beta.hat.jags[2]*x10_test+
             beta.hat.jags[3]*x11_test

prob_predict_test<-exp(linear_pred_test)/(1+exp(linear_pred_test))

y_pred<-rbinom(nrow(dataset_test),1,prob_predict_test)

accuracy<-mean(y_test==y_pred)

The accuracy reached is: 0.65 !! This is a really good result utilizing only a simple classifier such as the Logistic Regression, we could try to improve the performance for example taking a more complex model ,maybe not a linear one, but a model with different combination of the features or different priors for the parameters.

References

Brixius, Nathan. “The Logit and Sigmoid Functions.” https://doi.org/https://nathanbrixius.wordpress.com/2016/06/04/functions-i-have-known-logit-and-sigmoid/.
Dataset, Kaggle. “Wine Quality Classification,data Set for Binary Classification.” https://www.kaggle.com/nareshbhat/wine-quality-binary-classification.
Lam, Patrick. “Convergence Diagnostics.” https://doi.org/http://patricklam.org/teaching/convergence_print.pdf.
Paulo Cortez, Fernando Almeida, Antonio Cerdeira. “Modeling Wine Preferences by Data Mining from Physicochemical Properties.” https://doi.org/http://www3.dsi.uminho.pt/pcortez/wine5.pdf.
RDocumentation. “Gelman and Rubin’s Convergence Diagnostic.” https://doi.org/https://www.rdocumentation.org/packages/coda/versions/0.19-4/topics/gelman.diag.
———. “Geweke’s Convergence Diagnostic.” https://doi.org/https://www.rdocumentation.org/packages/coda/versions/0.19-4/topics/geweke.diag.
———. “Heidelberger and Welch’s Convergence Diagnostic Description.” https://doi.org/https://www.rdocumentation.org/packages/coda/versions/0.19-4/topics/heidel.diag.
Repository, UCI Machine Learning. “Wine Quality Data Set.” https://archive.ics.uci.edu/ml/datasets/wine+quality.